Financial Contribution to Presidential Campaigns in California

Heshuang Zeng 07/01/2016


Summary

This report analyzes the presidential financial support pattern in California using open dataset. We found that the Democratic candidates enjoy wide popularity in California, especailly Bernard and Hillary. While the number of Repubican Candidates is large, their combined support is less than either Bernard or Hillary.

Further examining the support pattern by candidate inside democratic party we found Bernard is more popular in the Northern California and low-income class. Hillary are more welcome in the upper class and high-income neighborhoods. Candidate support within a party differs on supporters’ income level, zipcode, occupation.

Support pattern in California are highly related to income level, residential location, as well as occupation.


The Data Structure

The dataset records the presidential contribution from individuals in California in presiential election. It’s a tabulor dataset contains 548166 observations of 18 variables dated from Jan 1, 2014 to Apirl 30, 2016. Each observation is a transaction that a contributor made to support a candidate. Therefore, each transition contains three parts of information: - The contributor, including its name, zipcode, city, employer, occupation, and state; - The candidate, including candidate’s name, committee ID and candidate ID; - Transition details, including receipt date, receipt amount, receipt description, memo code, memo, from_type, transition ID, amount, file number and election type. Exploring this rich dataset helps us understand the political landscape in California presidential election.


Main interests of Exploratory Research

FALSE 'data.frame': 548166 obs. of  18 variables:
FALSE  $ cmte_id          : Factor w/ 25 levels "C00458844","C00500587",..: 7 7 7 7 7 7 7 7 7 7 ...
FALSE  $ cand_id          : Factor w/ 25 levels "cand_id","P00003392",..: 13 13 13 13 13 13 13 13 13 13 ...
FALSE  $ cand_name        : Factor w/ 25 levels "Bush, Jeb","cand_nm",..: 20 20 20 20 20 20 20 20 20 20 ...
FALSE  $ contrb_nm        : Factor w/ 100017 levels "_BOOTH, ELAINE S.",..: 4881 12668 12678 12681 12694 899 5959 13198 13198 13198 ...
FALSE  $ contrb_city      : Factor w/ 1464 levels "","*MORENO VALLEY",..: 1138 1175 721 570 1184 249 1072 1179 1179 1179 ...
FALSE  $ contrb_st        : Factor w/ 2 levels "CA","contbr_st": 1 1 1 1 1 1 1 1 1 1 ...
FALSE  $ contrb_zip       : Factor w/ 85151 levels "","00000","000090272",..: 53916 41095 1547 33840 11562 23445 66550 71648 71648 71648 ...
FALSE  $ contrb_employer  : Factor w/ 34055 levels ""," APPLE INC.",..: 4793 18837 26433 4109 11817 9836 20673 24075 24075 24075 ...
FALSE  $ contrb_occupation: Factor w/ 15689 levels ""," REAL ESTATE BROKER",..: 11518 1989 15008 4683 11277 4683 9115 11855 11855 11855 ...
FALSE  $ contb_receipt_amt: Factor w/ 5870 levels "-.44","-.76",..: 599 599 3760 2425 3675 3984 4925 4415 1650 4415 ...
FALSE  $ contb_receipt_DT : Factor w/ 488 levels "01-APR-15","01-APR-16",..: 452 452 452 452 452 452 452 452 452 452 ...
FALSE  $ receipt_desc     : Factor w/ 66 levels "","* EARMARKED CONTRIBUTION: SEE BELOW REATTRIBUTION/REFUND PENDING",..: 1 1 1 1 1 1 1 1 1 1 ...
FALSE  $ memo_cd          : Factor w/ 3 levels "","memo_cd","X": 1 1 1 1 1 1 1 1 1 1 ...
FALSE  $ memo_text        : Factor w/ 250 levels "","*","* EARMARKED CONTRIBUTION: SEE BELOW",..: 3 3 3 3 1 3 3 3 3 3 ...
FALSE  $ from_tp          : Factor w/ 4 levels "form_tp","SA17A",..: 2 2 2 2 2 2 2 2 2 2 ...
FALSE  $ file_num         : Factor w/ 129 levels "1003942","1004025",..: 64 64 64 64 64 64 64 64 64 64 ...
FALSE  $ tran_id          : Factor w/ 545830 levels "A000771210424405B8CF",..: 353073 352207 347385 351166 323101 352521 348822 349557 349644 349911 ...
FALSE  $ election_tp      : Factor w/ 5 levels "","election_tp",..: 4 4 4 4 4 4 4 4 4 4 ...

Univariate Analysis

Key variables

  • The key variables are transaction amountcontb_receipt_amt, candidate namescand_name, contributors’ location by city contr_city or zipcode contr_zip and occupation contr_occupation.
  • Other features such as the party of candidates party, date of the donation contb_receipt_DT` and the category of contributioncontrb_category`` by amount might also be helpful.

Tidy and Clean the Data

Three changes are made to tidy and clean the data

  • Changed the format of contb_receipt_DT from factor to date.
  • Change the format of contb_receipt_amt from factor to character then to numeric. I notice there are some negative values, so in the analysis we d better only use the positive value.
  • Extract the first five digits to make them consistent using stringr, since the zip code data is also inconsistent, some are nine digits and some are five.

Create New Variables

Two new variables are created.

  • I create a new variable called ‘party’ to help understand which party gain more popularity in California.
  • I also create a variable called ‘contribution_category’ which includes five levels: negative, 0-50, 51-200, 201-1000, and 1000+.

Univariate Plots Section

Clean and Create New Variables

Creating new variable ‘party’

Since using loop function to calculate the new field is too slow, so I subset the dataset into three, then add party variable to each of them and rbind them.

Tidy the variable of contribution amount

Step 1.Turning it from factor to numeric

FALSE      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
FALSE -10000.00     15.00     27.44    135.20    100.00  10800.00

Step 2. Plot the contribution amount by distribution

Create a new variable called ‘contrb_category’

This variable categorizes the contribution amount to five categories: negative, 0-50, 51-200, 201-1000, more_than_1000. We plot the number of contribution by category and find most contributions are under 50USD.

FALSE $title
FALSE [1] "Count by Contribution Amount Backet"
FALSE 
FALSE attr(,"class")
FALSE [1] "labels"

Clean the contrb_zip variable

Update the zipcode by only including the first five digits and find the top 10 neighborhoods by count.

FALSE 
FALSE 94110 94114 94611 90046 94117 95060 94109 90049 90069 90405 
FALSE  4506  3893  3234  3089  3031  2738  2349  2328  2325  2301

Ditribution of Key Variables

Candidates and Contribution Count

The top 10 candidates by popularity (Number of Supports)

FALSE 
FALSE          Sanders, Bernard   Clinton, Hillary Rodham 
FALSE                    309233                    125141 
FALSE Cruz, Rafael Edward 'Ted'       Carson, Benjamin S. 
FALSE                     53207                     27337 
FALSE              Rubio, Marco            Fiorina, Carly 
FALSE                     13755                      4696 
FALSE                Paul, Rand                 Bush, Jeb 
FALSE                      4255                      3109 
FALSE           Kasich, John R.          Trump, Donald J. 
FALSE                      2857                      1370

Contribution Count by City

Top 10 Cities by Number of Supports

FALSE 
FALSE   LOS ANGELES SAN FRANCISCO     SAN DIEGO       OAKLAND      SAN JOSE 
FALSE         40072         37034         19257         13507         13246 
FALSE      BERKELEY    SACRAMENTO  SANTA MONICA    LONG BEACH    SANTA CRUZ 
FALSE         10469          9207          5819          5758          5448 
FALSE SANTA BARBARA    SANTA ROSA      PASADENA     PALO ALTO        FRESNO 
FALSE          5404          5311          4862          4727          3920 
FALSE  WALNUT CREEK        IRVINE   BAKERSFIELD         DAVIS     SUNNYVALE 
FALSE          3521          3392          3119          3079          3008

Contribution Count by Occupation

Top 10 Occupations by Number of Supports

FALSE 
FALSE          NOT EMPLOYED               RETIRED              ATTORNEY 
FALSE                 83733                 81385                 12342 
FALSE               TEACHER              ENGINEER     SOFTWARE ENGINEER 
FALSE                 12126                  8500                  8106 
FALSE             HOMEMAKER             PHYSICIAN INFORMATION REQUESTED 
FALSE                  6642                  6505                  5403 
FALSE            CONSULTANT 
FALSE                  5237

Contribution Count Overtime

Here we examine the supports made after 2015

FALSE         Min.      1st Qu.       Median         Mean      3rd Qu. 
FALSE "2013-11-05" "2015-12-31" "2016-02-29" "2016-02-01" "2016-03-31" 
FALSE         Max. 
FALSE "2016-04-30"

Number of Supports(Count) by Employer

The top ten employers that give the most number of supports. As it is quite unclear and contains too many categories. I did not carry this over in further analysis.

FALSE 
FALSE                                RETIRED 
FALSE                                  57203 
FALSE                                   NONE 
FALSE                                  50067 
FALSE                           NOT EMPLOYED 
FALSE                                  48536 
FALSE                                    N/A 
FALSE                                  31263 
FALSE                                   SELF 
FALSE                                  29351 
FALSE                          SELF EMPLOYED 
FALSE                                  28478 
FALSE                          SELF-EMPLOYED 
FALSE                                  26126 
FALSE                  INFORMATION REQUESTED 
FALSE                                   5308 
FALSE INFORMATION REQUESTED PER BEST EFFORTS 
FALSE                                   4056 
FALSE                                        
FALSE                                   3389

Bivariate Plots Section

Total Contribution by Candidate

Who are the 10 most popular candidates? How much contribution do they received? I created a new table on the candidate’s donation received called receipt_by_candidate I plot the propotion of support count and the propotion of total contribution by candidate. We found Hillary is the candidate that received the most fund, while Sanders is the candidate gain the largest number of individual support. Note, I color the barchat by color, these two charts actually are multivarite plots. But I place it here to make the report better follow.

Total contribution and contribution distribution by party

We plot the count of support and the distribution of contribution amount by party.
The average donoation to the democratic party is far less than that made to republican party.

Total contribution and contribution distribution by candidate

When you plot the donation distribution by candidates, you get the similar finding, the democratic candidates receive many contributions but the amount are relatively small.

Top contributions by city

  • Top ten cities by total contribution We first select the top 10 cities where the number of transactions are over 100 Then we plot the total contribution by city of the top 10

  • LA and San Francisco are top cities that contribute the most donation

  • The Top 10 Cities by Average Contribution

In response to question of “Where do the rich donors live?”. We look into the cities having highest average contribution and also with over 100 transactions. We found there is limited overlap between top cities with total contributions and rich cities with high average contribution. Rich communities did not dominate CA presidential contribution.

Total contribution by occupation

  • The retired and unemployed contribute the most in persidential campaign, the homemaker is also a very important political force, since its average donation is even higher than attorny, only smaller than CEO and persident.

  • Homemarkers, attorney and IT workers top by mean Contribution

Total Contribution by Zipcode

The top ten neighborhood by total contribution

Contribution Counts and Total Overtime by Party

  • The democratic dominates in terms of support counts, especially in 2016.
  • In terms of total amount, in 2015, the contribution is similar between two parties, but after 2016, domocratic has significant edge. This might be caused by unexpected raised popularity of Donald Trump and the dropoffs of other republican candidates.

Note, the plot on contribution overtime is a multivariate plot, but I placed it here for the ease of comparison.


Bivariate Analysis

1. Significant popularity of Democratic Candidates

Democratic supportors dominate California, no matter in terms of support count and total contribution. However, the number of supports and the total fund received by candidates is not necessarily positively related in California.

Bernard and Hillary

More than half of support counts are for Sanders, but Clinton raise the most contributions, twice the Sanders. Mean donation by individual varies by candidate a lot.

2. Key supportive forces by occupation.

Supportors by occupation is interesting. The retired comprises the largest supportive force. The not-employed has the second largest, which is unexpected. Homemaker also have very strong power in political landscape in California as they contribute the forth largest amount of money to candidates with very high average.It would be interesting to see who they support

3. Number of Contributions over time by party

Democratic candidates receive incresed number of contributions overtime, while support to republican candidate seems to decrease after 2016. This might due to the drop off of many republican candidates and unexpect raise of Trump.


Multivariate Plots Section

Total Contribution by Candiates and Contribution Category

As Domocratic party has significant popularity in California and Hillary and Bernard are two most popular candidate, it might be interesting to create a new variable called cand_name2 include Hillary, Bernard and other republican candidates.

As shown in the following plot, Bernard received support mostly from small contributions.While Hillary seem to be popular accross different group. She also received lots of contributions which are over 1000USD.

Support Pattern by Occupation

Retired and homemaker more likely to support republic candidates. They are also more likely to support Hillary than Bernard. Not-employed only support democratic candidates, and most of them support Bernard.

Support pattern in top and bottom cities in terms of contribution avereage

Both high and low communities have diverse political voices. However, high-income communities are more likely to support republican candidates or Hillary. Sanders has larger support in the poor community.

Geospatial Analysis

Even though we analyzed the support pattern in rich and poor cities, it is still quite abstrate to understand the full picture. This part explores the spatial pattern in candidate support, we create a new dataset Bernard_index that maps the support rate of Bernard, Hillary and Republican Candidates as well as the total contribution in each zipcode area. I linked this table with the geoinformation table of California. Two maps are created

  • Bernard_index map shows support rate of Bernard (# of Bernard contributions/total of contributions) in each zipcod, the reder the area, the higher the support rate of Bernard

  • Bernard index and contribution size plots a poit of each zipcode where the size indicates the total contributions and the color indicates the Bernard support rate.

From these two maps we can see, the support to Bernard is overwhelming in California, especially in North California and the Bay area.

Create a new table call zip_summary which includes a Bernard_index

Plot the Bernard Index by point

Plot the Bernard Index by polygon

FALSE OGR data source with driver: ESRI Shapefile 
FALSE Source: "cb_2014_us_zcta510_500k", layer: "cb_2014_us_zcta510_500k"
FALSE with 33144 features
FALSE It has 5 fields

Multivariate Analysis

Candidates’ Popularity varies by geography and occupation

Mulivariate Analysis confirm again that candidates seems to have different popularity in different income groups.

  • The affluent are more likely to support the Republican Candiates, and when they support Domocratic Candidate, they are more likely to support Hillary. This is reflected in support parttern by geography.
  • Retired, Homemakers and Attorneys are more likely to support Republican Candidates or Hillary, while unemployed are more likely to support Bernard.

Classification Model

it is also possible to contruct a classification model (using logistic regression, ramdon forest, gradient boosting, svm, or netro nework) to predict the candidate based on the amount of contribution, the zipcode, and their occupation. I did not try any model here, but would be interested in exploring these options later.


Key Plots and Summary

1. Candidates by Popularity

Democratic Candidates has significant popularity in California, while Bernard received most support in transaction count, Hillary receive the most supporty in terms of total Contribution.

2. Support pattern by Candidate

The affluent are more likely to support the Republican Candiates, and when they support Domocratic Candidate, they are more likely to support Hillary. Bernard gain supports mostly from small contributions. The support pattern by cities with different levels of average contribution support this finding as well.

3.Support Pattern by Region

Most of contributions are concetrated in the metropolitan area in California, especially the Bay Area and the LA Metropolis. Although the support to Bernard is overwhelming accross California, his popularity is more significant in Northern California than the Mid and the Southern.


Reflection

Summary

This analysis screens the support pattern in California. We found Democratic candidates has wide popularity in California, especailly Bernard and Hillary. While the number of Repubican Candidates is large, their combined support is still less than either Bernard and Hillary.

Further examining the support pattern by candidate in democratic party we found Bernard is more popular in the Northern California and low-income class. Hillary are more welcome in the upper class and high-income neighborhoods. Candidate support within a party differs on supporters’ income level, zipcode, occupation.

Struggle

In my exploratory analysis, my biggest struggle is on two things

  • How to subset the data The geo variables like zipcode and cities have so many levels, and readers actually can process limited number of them in a plot, so I struggle about whether to choose 5 or 10 or 20 of them to showcase the relation. In the end, I chose 20.
  • How to recategorize data In my analysis, I found party is a category of coarse granularity and candidate is a category of over fine granularity. In the end, I create a category that seperate Hillary from Bernard, and make them at the same level of republican candidates. Also this seems to be illogical, but I found the category help tells lots of story. I also feel the contribution amoun category is very helpful and connect the whole story together, as it clearly mark out the difference in terms of supporters between Hillary and Bernard.

Future Works

As California is known to be a democratic state for a long time. It would be interesting to compare the pattern presidential contribution to national average. On the other hand, it is also possible to contruct a classification model to predict the candidate based on the amount of contribution, the zipcode, and their occupation.